Predicting Traffic Incident Casualty Severity

Road Safety Dataset

As seen in the project proposal. Finding a dataset with accurate and satisfied data is a challenge in itself. Thus the choice of the dataset by itself can be considered a bias. As a student it is near impossible to certify rules for data completion, or even create a dataset by yourself from tracking actual data on the internet. This task is not feasible in the short term.

The United Kingdom has a more open and established data source. Having implemented a tracking system for traffic incidents starting from 2006. A single year of data consists of three seperate sets.

  1. Characteristics of the incident
  2. Vehicles involved
  3. Casualties

This data can give an accurate description of the situation and involved parties with resulting damages. Most importantly covering the target variable casualty_severity

For a start each of the datasets will be imported

Here we can observe an interesting trait. Namely, the count of each dataset. An order of grouping can already be established.

  1. Place
  2. Vehicles
  3. Casualties

Lets take a look at each dataset and its features so we can properly confirm this theory.

Inspecting datasets

Each of these datasets already comes partly categorized. As certain features display a number. This can be directly linked to the Reference Table under each dataset.

Characteristics

As specified previously this dataset concerns the descriptive features of each accident. Mostly where and when it took place. Some other features like road types or junction details can also be observed. Also weather details can be observed. These might contain an interesting correlation to the casualty severity in each case.

An important thing we can notice is the accident_index. This index allows cross referencing with the other datasets.

Reference Table

Casualty

In this dataset we observe the features that are of big importance because they contain information about the casualty, and most importantly the severity of damages. Next to that we also have some descriptive feature about the person. And what type of role in traffic they were executing during the incident.

Reference Table

Vehicle

In the vehicle section information about the type and actions can be found. What was the vehicle doing and who was inside? Are questions that can be anwsered with this dataset. These can be linked through the accident_index.

Missing Data & Types

Before we start doing any actual work of displaying and linking data we can easily check for missing datapoints or features that can be excluded.

Only if it can be presumed they would not have any significant impact in the accuracy of the model.

From this we can observe some irregularities:

Characteristics

  1. junction_control has 41% null values. Can be excluded as it is a additive description to junction_detail which would be considered enough
  2. trunk_road_flag has 7% null values. Concerns by who the road network is managed that the event took place at. This can be excluded as it is another added descriptive feature.

Vehicle

  1. age_of_driver has 14% null values. For most of these cases they fall around the same percentile area. This could be because of the previously described problem in the proposal: Fleeing of the scene
  2. age_band_of_driver has 14% null values. This is directly linked to age_of_driver
  3. engine_capacity_cc has 26% null values. It seems that in some circumstances some descriptive features from the vehicle were left empty. same accounts for nuber 4, 5, 6, 7, 8 on this list. This is could have a link but it is not sure because the other features are present. For now these do not account for a big part so we will leave them in.
  4. propulsion_code has 25% null values.
  5. age_of_vehicle has 25% null values.
  6. generic_make_model has 28% null values.
  7. driver_imd_decile has 18% null values.
  8. driver_home_area_type has 18% null values.

Casualty

  1. casualty_home_area_type has 9% null values. This can be directly related to the previous cases. The might be specific circumstances in which these cannot be applied to the current situation. Or they fall in the own specific category for which there is no law enforcement applicable. These are small cases but we need more insight to specify wether they can be excluded.
  2. casualty_imd_decile has 9% null values.

In order to see if the data is usable for modeling we need to check for the types. As most classifying algorithms take float or int like values.

As we can see most data is float or int. This is because the data is already categorized and can be decoded with the reference table. Yet there are some Objects left. We need to inspect these and the other features to check if they can be used for modeling.

Feature Selection

In order to find usable features we need to observe the distribution. We can pick out points of interest and furtherhly display these with the reference.

Characteristics

First we check what kinds of objects are in each dataset.

What we observe here is mostly reference information for the accident. And what local authority the accident took place in. These can be excluded as we have geographical data which can also be used to describe this.

Whats interesting is the date. Lets observe what the distribution of these accidents is.

In order to do this we first need to convert the date object into a datetime format.

As we can see we have the total count of accident each day. But they are not ordered yet and still in as an object. Lets convert and sort them.

Now we can plot them in a graph.

The remaining set of features is large. That is why first a look will be taken at the general distribution of each feature. After which a selection of features can be furtherly observed.

Interests

We can display the results on a map to check the concentration using longitude and latitude.

What is not visible on this map is the actual distribution of some of these features. So we can take a further look at some interests. We see a high concentration of cases near London. Which would correlate to police force Metropolitan Police. Lets check the distribution of cases:

Here we can indeed see number 1 having most cases. Through the reference table we can find that this is the Metropolitan Police

Here we can see the top police forces with most incidents. At first on the map it is not visible how big of a difference in cases there actually is. But the more you zoom in, the higher the concentration becomes. It would be interesting to see what the severity rates are.

We observe that only a small amount of cases end in fatality. This seems fair as there are hundreds of small accidents each day. And not many of those end in fatality. Another interest is the distribution of these cases per police force. What would be expected is that the highest ranking areas have the highest cases of death.

Here we can observe that the Metropolitan police do not have the highest cases of deaths. This is in fact Police scotland. This could be linked to the fact that there are more dangerous roads in scotland. Compared to city streets in London and the surrounding areas.

Here we further see that most cases happen on single carriegeways. These usually are highly concentrated road networks connecting various places.

Finally we can confirm our theory that the characteristics are linked to one or more casualties. These tables can later be merged and the data can be satisfied with both features.

Removable Features

In the end we saw some features which can be better described with another, or are not usable because of their lack of information or correlation.

time, local_authority_ons_district, local_authority_highway, lsoa_of_accident_location, accident_year, location_easting_osgr, location_northing_osgr, second_road_class, second_road_number, road_surface_conditions, trunk_road_flag

Casualty

Here we observe only the accident_reference being an object. This will later be used to link casualties to the characteristics. The other features will be displayed and described in a group.

Interests

Here we actually observe a lower count of fatal accidents. This can be to the fact that the accident_severity is based on the grouped outcome of the accidents. and casualty_severity is the actual outcome per casualty. This would explain the high amount of slight severity and lower amount of fatal.

Removable Features

accident_year, pedestrian_road_maintanance_worker

Vehicles

For vehicles we observe again the accident_reference. And also the model of car. This could be an interesting feature for car manufacturers who would like to see which models have the highest fatality rates. As we saw before this feature contains 28% of datapoints that are null. These might be cases with pedestrains though, so lets take a look at that.

As expected. We can see that a lot of datapoints are -1. We can use vehicle_reference from casualty after concatenating. to check the distribution. For now here is the remaining features.

Removable Features

accident_year

A better look needs to be taken at some features after concatenation:

sex_of_driver, age_of_driver, age_band_of_driver, engine_capacity_cc,propulsion,age_of_vehicle

Cleaning Data

We had a look a the data. And found some irregularties. These will now be processed. As many features as possible have been kept for the best possible prediction. Later features with most importance will be displayed and furtherly selected. By applying a Random Forest Classifier for checking the stability of the data.

Concatenating Datasets

For this we will observe a case with 2 casualties and 13 vehicles. There will be some features of importance.

Based on the accident_index all vehicle datapoints can be satisfied with the characteristics data. number_of_vehicles is the check for this.

Again the accident_index will be the main key. But vehicle_reference will be used for ordering casualties.

Here we see the vehicle_reference that will be used. Both casualties were in the first vehicle. Further, we can see that the age_of_casualty are 73 and 40. Referencing that to the vehicles dataset, we find that age_of_driver was 40.

Now the two dataset will be merged with keys accident_index and vehicle_reference. This way each casualty can be linked to a vehicle.

We observe a decline in datapoints. This is to the fact that not every vehicle involved in an accident had a casualty. If we check back on our previous example we see that data transfered over precisely. Information about the vehicle and passengers is matching.

Next we merge the characteristics for each casualty case. This way we obtain an accurate description about the situation from each casualty.

The dataset remains its shape with added features. We can now furtherly investigate and finally check what final features can be removed.

Here we have the final dataset we will be working with. The final step before modelling will be preprocessing. Here are non-numerical values will be categorically classified into a numerical value.

Final usable dataset. This dataset can be exorted as a .csv file and saved in a cloud environment for general purpose use. In this case this will not be done as the modelling and evaluation will be in this document.

Target Variable

Based on the concatenation of the datasets we now have added descriptive features for each case of a casualty within a traffic accident. Now we can start making a plan for the prediction of this variable. First lets visualize the displacement. We previously saw that not many cases involve fatal injuries. Thus the data may be distributed unevenly and hard to use in actual machine learning.

Here we can observe a clear uneven distribution of data. It would be difficult to find instances in which an accurate estimate of the severity can be done because of the lack of cases. In order for the machine to learn the distribution of these has to be even. This way it has equal understanding of situations involving each type of severity.

In order to achieve this down- and oversampling will be applied. With this it takes random cases of each class and removes or keeps it based on the total value. The end goal is to lower or higher the count for each severity while keeping the distribution equal.

By downsampling the data, the model will have lesser instances for each case to learn from. Thus the possibility of a wrong prediction will be greater, as a chance to mix two close examples might favor one side.

On the side of oversampling data there might be too much information. As such the model might always select the outcome based on a previously existing combination.

In order to check for irregularities, from this point on all 3 dataset will be evaluated. This way we can check what impact down or oversampling has on the accuracy.

Machine Learning

One of the most used algorithm types for classification are Decision Trees. In order to oversee if this data can be of any benefit, a test will be made using the Random Forest Classifier.

The data first has to be split into the target variable and usable features. A comparison will be made between the oversampled and normal dataset

First the target variable has to be removed from the datasets.

Splitting the data into training and tests sets.

Training models based on the 3 datasets.

Now we can do our first predictions.

Here we observe what we expected. The accuracy increases when oversampling. and decreases with downsampling. Even though an accuracy of 97% seems very high. So there might be some kind of bias or instability involved.

First lets check some more statistics from the model.

We can observe that with the low amount of cases, there is not a lot of sureness about the severity. Especially in the case of predicting fatality, it is only accurate 0.023% of the time. When comparing this to the oversampled results, we observe a big difference.

With a much more satisfied dataset. Now there is barely any doubt. We can see that almost in all cases the results are indeed predicted accurately. In order to make sure there is no bias, we can take a look at the features of most importance.

Here we see that the age_of_casualty has a high importance. This could be linked to the fact that younger drivers might be more dangerous, or older drivers are more vulnerable. After that we mostly observe features from the vehicles dataset. Proning to the fact that certain types of vehicles have higher rates of casualties.

Finally there are no features which could have an impact on the prediction of the model. Say if a car has 5 passengers that there is always a casualty.

We can observe that the recall of slightly injured is low. meaning that this is where the model is still unsure. At this stage it is hard to estimate what other features could improve the accuracy.

Comparing Different Algorithms

As we have previously seen it is possible to quite accurately predict the casualty severity in a traffic accident using the Random Forest Classifier. But what about other algorithms? Is there a chance to predict it more accurately. We will start again by splitting the dataset into a training and testing set.

Training Models

Display Random forest logic

Grid search -> SVC or RandomForest -> Crossvalidation

From the results we observe we can assume that the Random Forest Classifier scores the best and thus we can continue with further optimizing it. To see if it is possible to get even better results.

Optimizing Random Forest

Conlusion

Based on the research above we can conclude that for this case the Random Forest Classifier is the top pick. Generating great results with the over sampled data.

Some more extensive data cleaning can be done to remove the remaining features that have little to no impact on the prediction to further improve the score.

Another improvement which could be made is combining data from multiple years. This would have been difficult to compute so choice has been made to evaluate data from one year. With more computing power it would be possible to get a better result combining all the years.

Also some more extensive hyper paramater tuning can be done, running it for more iterations could improve the accuracy. This is not done in this notebook because it is very time consuming. Choice has been made to go for 5 iterations which take around 10 minutes.

References

[1] UK Gov. Road Safety Data. (n.d.). Retrieved November 27, 2021, from https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data.

[2] What are the different types or road in the UK? Bituchem. (2021, January 27). Retrieved November 28, 2021, from https://www.bituchem.com/knowledge-hub/what-are-the-different-types-of-road-in-the-uk/.

[3] Service, G. D. (2015, April 5). Speed limits in the UK. GOV.UK. Retrieved November 28, 2021, from https://www.gov.uk/speed-limits#:~:text=National%20speed%20limits,there%20are%20signs%20showing%20otherwise.

[4] Highways England. GOV.UK. (n.d.). Retrieved November 28, 2021, from https://www.gov.uk/government/organisations/highways-england.

[5] The English index of multiple deprivation (IMD) 2015 guidance. (n.d.). Retrieved November 28, 2021, from https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/464430/English_Index_of_Multiple_Deprivation_2015_-_Guidance.pdf.